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Colouring and breaking sticks: random 
distributions and heterogeneous clustering 

Peter J. Green^, 



Abstract 

We begin by reviewing some probabilistic results about the Dirichlet 
Process and its close relatives, focussing on their implications for statist- 
ical modelling and analysis. We then introduce a class of simple mixture 
models in which clusters are of different 'colours', with statistical charac- 
teristics that are constant within colours, but different between colours. 
Thus cluster identities are exchangeable only within colours. The basic 
form of our model is a variant on the familiar Dirichlet process, and 
we find that much of the standard modelling and computational ma- 
chinery associated with the Dirichlet process may be readily adapted to 
our generalisation. The methodology is illustrated with an application 
to the partially-parametric clustering of gene expression profiles. 

Keywords Bayesian nonparametrics, gene expression profiles, hierarch- 
ical models, loss functions, MCMC samplers, optimal clustering, parti- 
tion models, Polya urn, stick breaking 
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1 Introduction 

The purpose of this note is four-fold: to remind some Bayesian nonpara- 
metricians gently that closer study of some probabilistic literature might 
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be rewarded, to encourage probabilists to think that there are statistical 
modelling problems worth of their attention, to point out to all another 
important connection between the work of John Kingman and modern 
statistical methodology (the role of the coalescent in population genetics 
approaches to statistical genomics being the most important example; 
see papers by Donnelly, Ewens and Griffiths in this volume), and finally 
to introduce a modest generalisation of the Dirichlet process. 

The most satisfying basis for statistical clustering of items of data is 
a probabilistic model, which usually takes the form of a mixture model, 
broadly interpreted. In most cases, the statistical characteristics of each 
cluster or mixture component are the same, so that cluster identities are 
a priori exchangeable. In Section [5] we will introduce a class of simple 
mixture models in which clusters are of different categories, or colours 
as we shall call them, with statistical characteristics that are constant 
within colours, but different between colours. Thus cluster identities are 
exchangeable only within colours. 



2 Mixture models and the Dirichlet process 



Many statistical models have the following character. Data {Yi} are 
available on n units that we shall call items, indexed i = 1, 2, n. 
There may be item-specific covariates, and other information, and the 
distribution of each Yi is determined by an unknown parameter <pi £ fl, 
where we will take f2 here to be a subset of a Euclidean space. Apart from 
the covariates, the items are considered to be exchangeable, so we assume 
the {Yi} are conditionally independent given {</>i}, and model the {4>i} 
as exchangeable random variables. Omitting covariates for simplicity, we 
write Yi\(f>i ~ f 

It is natural to take {4>i} to be independent and identically distrib- 
uted random variables, with common distribution G, where G itself is 
unknown, and treated as random. We might be led to this assumption 
whether we ar e think ing of a de Finetti- s tyle r ep resentation t heore m 
jFinettil |l93ll Il937h : see also iKingmanl d 19781) iKallenbersJ (20051)) 



or by follow i ng hi erarchical modelling principles (|Gelman et al.l . 11995; 
Green et all l2003l) . Thus, unconditionally, Y t \G ~ / f(-\(/))G(d(p), inde- 
pendently given G. 

This kind of formulation enables us to borrow strength across the units 
in inference about unknown parameters, with the aim of controlling the 
degrees of freedom, capturing the idea that while the {4>i} may be dif- 
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ferent from item to item, we nevertheless understand that, through ex- 
changeability, knowing the value of one of them would tell us something 
about the others. 

There are still several options. One is to follow a standard paramet- 
ric formulation, and to assume a specific parametric form for G, with 
parameters, or rather 'hyperparameters', in turn given a hyperprior dis- 
tribution. However, many would argue that in most practical contexts, 
we would have little information to build such a model for G, which 
represents variation in the population of possible items of the parameter 
<f> that determines the distribution of the data Y. 

Thus we would be led to consider more flexible models, and one of 
several approaches might occur to us: 

• a nonparametric approach, modelling uncertainty about G without 
making parametric assumptions; 

• a mixture model representation for G; 

• a partition model, where the {4>i} are grouped together, in a way 
determined a posteriori by the data. 

One of the things we will find, below, is that taking natural choices 
in each of these approaches can lead to closely related formulations in 
the end, so long as both modelling and inference depend solely on the 
{<pi}. These connections, not novel but not entirely well-known either, 
shed some light on the nature and implications of the different modelling 
approaches. 



2.1 Ferguson definition of the Dirichlet process 



Much Bayesian nonparametric distributi onal modelling ((Walker et al 



1999) begins with the Dirichlet process ([Ferguson! . 119731 ). Building on 
earlier work by Dubins, Freedman and Fabius, Ferguson intended this 
model to provide a nonparametric prior model for G with a large sup- 
port, yet one remaining capable of tractable prior-to-posterior analysis. 

Given a probability distribution Go on an arbitrary measure space f2, 
and a positive real 9, we say the random distribution G on f2 follows a 
Dirichlet process, 

G~DP(8,G Q ), 

if for all partitions fi = [J™^ Bj (Bj n B^ = if j ^ k), and for all m, 



(G(Bi), . . . , G(B m )) ~ Dirichlet(0G o (Bi), • . .,0G o {B m )), 
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where Dirichlet(ai, a 2) ■ • ■ , a m ) denotes the distribution on the m-dim- 
ensional simplex with density at (xi,X2, ■ ■ ■ ,x m ) proportional to IIj=i 

a-j—l 

The base measure Go gives the expectation of G: 

E(G(B)) = G (B). 
Even if Go is continuous, G is a.s. d iscrete ( Kingmanl . 1967 : Ferguson! 



1973tlBlackwell . ll973l:lKingmanlll975l) . so i.i.d. draws {<f>i,i = 1,2, ...,n} 



from G exhibit ties. The parameter 9 measures (inverse) concentration: 
given i.i.d. draws {fa, i = 1, 2, . . . , n} from G, 

• as 9 — > 0, all fa are equal, a single draw from Go; 

• as 9 — ¥ oo, the fa are drawn i.i.d. from Gq. 



2.2 The stick-breaking construction 



A draw G from a Dirichlct process is a discrete distribution on f2, so an 
alternative way to define the Dirichlet process would be via a construc- 
tion of such a random distribution, through specification of the joint 
distribution of the locations of the atoms, an d their probabilities. Such 
a construction was given by I Ferguson! ([1973): in this, the locations are 
i.i.d. draws from Go, with probabilities forming a decreasing sequence 
constructed from increments of a gamma process. 

This is not the explicit construction that is most commonly used 
today, which is that known in the Baye sian nonparametric community 
as Sethuraman's s tick-breaking model ( Sethuraman and Tiwaril 19821 : 
Sethuramanl . ll994r ). This leads to this algorithm for generating the {fa}: 



1. draw <j)j 

2. draw Vj 



Go, i.i.d. 

Beta(l,0 



3 = 1, 2, 
, i.i.d., j -- 



1, 2, 



3. define G to be the discrete distribution putting probability (1— Vi)(l — 
V 2 )...(l-V j . 1 )V j on$; 

4. draw fa i.i.d. from G, i = 1, 2, . . . , n. 



This construction can be found considerably earlier in the probabil- 
ity literature, especially in connectio n with models for species sampling. 
The earliest reference seems to be in McCloskey (1965 ); for more readily 
access ible sources, see Patil and Tailliei ( 1977 ) and Donnelly and Jovce 
(1989). where it is described in the context of size-biased sampling and 
the GEM (Generalised Engen-McCloskey) distributions. See also Sec- 
tion 13.31 below. 
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2.3 Limits of finite mixtures 



A more direct, classical approach to modelling the distribution of Y in 
a flexible way would be to use a finite mixture model. Suppose that Yi 
are i.i.d. with density Wjfa(-\(j)*) for a prescribed parametric density 
family /o( - |</>), and consider a Bayesian formulation with priors on the 
component weights {u)j} and the component-specific param eters {4>*j}- 
The simplest formulation (e.g. iRichardson and Greenl (|1997h ) uses a Di- 
richlet prior on the weights, and takes the {4>j} to be i.i.d. a priori, but 
with arbitrary distribution, so in algorithmic form: 



1. draw (wi, W2, ■ ■ ■ , wu) ~ Dirichlet(<5, . . . 

2. draw c t £ {1, 2, . . . , k} with P{a = j} 

3. draw </>* ~ Go, i.i.d., j = 1, . . . , k; 

4. set d>i = <£*. . 



= M), 



i.i.d. 



1, .. 



It is well known that if we take the limit k — > oo, S — > such that 
kS — > 9, then the joint distribution of the {4>i} is the same as that ob- 
tained via the D irichlet process formulation in the previous subsections 
(see for example iGreen and Richardson! (120011)1. This result i s actually a 
corollary of a much stronger statement due to Kingman ( 19751) , about the 
convergence of di screte probability measur es. F or more recent results in 
this di rection see lMuliere and Secchil (|2003f) and llshwaran and Zarepour 
(|2002l) . 

We are still using the formulation Yi\G ~ J f(-\4>)G(d(j)), independ- 
ently given G, but note that G is invisible in this view; it has implicitly 
been integrated out. 



2.4 Partition distribution 

Suppose that, as above, G is drawn from DP(8, Go), and then {<fo : i = 
1,2, ... ,n} drawn i.i.d. from G. We can exploit the conjugacy of the 
Dirichlet with respect to multinomial sampling to integrate out G. For 
a fixed partition {Bj}JL 1 of fi, and integers Cj £ {1,2,..., m}, we can 
write 

T{6) ^neGoiB^+n,) 



P{*6B* 1 < = l > 2 f ...,n} = = ?5 y- T n 



ne + n)j- = \ r(flGo(B 3 -)) ' 

where n.j = =tf={i : Ci — j}. The jth factor in the product above is 1 if 
n, } = 0, and otherwise 9Go(B j ){9G (B j )+l)(9G (B j )+2) . . . (9G {Bj)+ 
rij — X), so we find that if the partition becomes increasingly refined, and 
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Go is non-atomic, then the joint distribution of the {fa} can equivalently 
be described by 

1. partitioning {1, 2, . . . , n} = Uj=i @j a ^ random, so that 



P(C 1 ,C 2 



,C d ) 



r(6 + n) 



^n^-l)! (2.1) 



j=i 



where rij = #Cj; 



2. drawing 4>* ~ Go, i.i.d., j = 1, 

3. setting fa = fa- if i e Cj. 



. , d, and then 



Note that the partition model (I2.1j) shows extreme preference for un- 
equal cluster sizes. If we let a r = =ff{j : rij = r}, then the joint distribu- 
tion of (ai, 02, . . .) is 



1 



m\n 2 \ ■■■n d \ 



xp(C 1 ,C 2 ,...,C d ). 



(2.2) 



This is equation (A3) of lEwensI (j 19721 ) . derived in a context where rij is 



the number of genes in a sample of the jth allelic type, in sampling from 
a selectively neutral population process. The first factor in (|2.2I) is the 
multinomial coefficient accounting for the number of ways the n items 
can be allocated to clusters of the required sizes, and the second factor 
accounts for the different sets of {n%, ri2 ■ ■ ■ , rid} leading to the same 
(ai, ct2, . . .). Multiplying all this together, a little manipulation leads to 
the familiar Ewens sampling formula: 



n!r(fl) Tj 
p(a 1) a 2 ,...) = f ^ T ^H- 



(2.3) 



See also Kingman! (|1993l ). page 97. 

This representation of the partit i on str ucture implied by the Dirichlet 
process was derived by lAntoniakl |l97J) , in the form f|2 . 3[) . He noted 
that a consequence of this representation is that the joint distribution 
of the {4>i} given d is independent for 9; thus given ob served \ <j)j\, d is 
sufficient for 8. A similar observation was also made by lEwensI (|1972l ) in 
the genetics context of his work. 

Note that as in the previous section, G has been integrated out, and 
so is invisible in this view of the Dirichlet process model. 
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2.5 Reprise 

Whichever of the points of view is taken, items are clustered, according 
to a tractable distribution parametrised by 6 > 0, and for each cluster 
the cluster-specific parameter <fi is an independent draw from Gq. Much 
statistical methodology built on the Dirichlet-process model uses only 
this joint distribution of the and so should hardly be called 'non- 

parametric'. Of course, even though G itself is invisible in two of the 
derivations above, the Dirichlet-process model does support inference 
about G, but this is seldom exploited in applications. 



2.6 Multiple notations for partitions 

In what follows, we will need to make use of different notations for the 
random partition induced by the Dirichlet-process model, or its relatives. 
We will variously use 

• c is a partition of {1, 2, . . . , n}; 

• clusters of partition are C% , C2, . . . , C& (d is the degree of the parti- 
tion): U U C i = {!. 2, • • • , «}, Cj n C r = if j ± f; 

• c is the allocation vector: c$ = j if and only if i G Cj. 

Note that the first of these makes no use of the (arbitrary) labelling 
of the clusters used in the second and third. We have to take care with 
multiplicities, and the distinction between (labelled) allocations and (un- 
labelled) partitions. 



3 Applications and generalisations 

3.1 Some applications of the Dirichlet process in 
Bayesian nonparametrics 

Lack of space precludes a thorough discussion of the huge statistical 
methodology literature exploiting the Dirichlet process in Bayesian non- 
parametric procedures, so we will only review a few highlights. 



Lol (|1984[ ) proposed density estimation procedures devised by mixing a 
user-defined kernel function K (y, u) with respect to a Dirichlet process; 
thus i.i.d. data {Yi} are assumed distributed as f K(-,u)G(du) with 
G drawn from a Dirichlet process. This is now known as the Dirichlet 
process mixture model (a better terminology than the formerly-used 
'mixture of Dirichlet processes'). The formulation is identical to that we 
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started with in Section [21 but for the implicit assumption that y and u 
lie in the same space, and that the kernel K(-,u) is a unimodal density 
located near u. 

In the 1990s there was a notable flourishing of applied Bayesian non- 
parametrics, stimulated by interest in the Dirichlet process, and the 
rapid increase in computational power available to researchers, allowing 
almost routine use of the Polya urn s ampler approac h (see Section 0} to 
posterior computati on. For example, Escobarl (|1994T ) re- visited the Nor- 
mal Means prob lem, West et al. ( 1994[) dis cussed regression and density 
estimation, and Escobar and Westl (119951) further developed Bayesian 
density estimation. Miiller et al. ( 19961) ingeniously exploited multivari- 
ate density estimation using Dirichlet process mixtures to perform Bayesian 
curve fitting of one margin on the others. 



3.2 Example: clustered linear models for gene 
expression profiles 

Let us consider a substantial and more specific application in some detail, 
to motivate the Dirichlet process (DP) set-up as a natural elaboration 
of a standard parametric Bayesian hierarchical model approach. 

A remarkable aspect of modern microbiology has been the dramatic 
development of novel high-throughput assays, capable of delivering very 
high dimensional quantitative data on the genetic characteristics of or- 
ganisms from biological samples. One such technology is the measure- 



ment of gene expression using Afiymetrix gene chips. In lLau and Green 



(|2007l) . we work with possibly replicated gene expression measures. The 



data are {Y isr }, indexed by 

• genes i = 1, 2, . . . , n, 

• conditions s = 1, 2, . . . , S, and 

• replicates r = 1, 2, . . . , R s . 

Typically R s is very small, S is much smaller than n, and the 'con- 
ditions' represent different subjects, different treatments, or different 
experimental settings. 

We suppose there is a /c-dimensional (k < S) covariate vector x s 
describing each condition, and model parametric dependence of Y on x; 
the focus of interest is on the pattern of variation in these gene-specific 
parameters across the assayed genes. 
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Although other variants are easily envisaged, we suppose here that 

Y isr ~ N(x' s P i ,T i ' 1 ), independently. 

Here = (/3j,Tj) £ lZ k+1 is a gene-specific parameter vector char- 
acterising the dependence of gene expression on the condition-specific 
covariates. A priori, the genes can be considered exchangeable, and a 
standard hierarchical formulation would model the {4>i} as i.i.d. draws 
from a parametric prior distribution G, say, whose (hyper)parameters 
have unknown values. This set-up allows borrowing of strength across 
genes in the interest of stability and efficiency of inference. 

The natural nonparametric counterpart to this would be to suppose 
instead that G, the distribution describing variation of <p across the 
population of genes, does not have prescribed parametric form, but is 
modelled as a random distribution from a 'nonparametric' prior such as 
the Dirichlet process, specifically 

G~DP(0,G o ). 

A consequence of this assumption, as we have seen, is that G is atomic, so 
that the genes will be clustered together into groups sharing a common 
value of <j). A posteriori we obtain a probabilistic clustering of the gene 
e xpression profiles . 



Lau and Greenl (|2007f ) take a standard normal-inverse Gamma model, 



so that <j) — (/?, t) ~ Go means 

T~T(a ,b ) and (3\t ~ N fe (m , (rto) -1 *). 

This is a conjugate set-up, so that (/3, r) can be integrated out in each 
cluster. This leads easily to explicit within-cluster parameter posteriors: 

r/|y~r(a J ,6 J ) ) 

^|r;,y~N fc (m i) (r;t i )- 1 ) ) 

where 

aj = a + l/2#{isr : a = j}, 
b, =b Q + l/2(Y Cj - X C] m )'{Xc ] t^X l C] )- 1 {Y C] - X C] m ), 
m, = (X' Cj X Cj +t I)- 1 (X' Cj Y Cj +t m ), 
tj = X'cjXcj +t a I. 

The marginal likelihoods p(lcj-) are multivariate t distributions. 
We continue this example later, in Sections 15.41 and 15.51 



10 Peter J. Green 

3.3 Generalisations of the Dirichlet process, and related 

models 

Viewed as a nonparametric model or as a basis for probabilistic cluster- 
ing, the Dirichlet process is simple but inflexible — a single real parameter 
9 controls both variation and concentration, for example. And although 
the space CI where the base measure Go lies and in which <f> lives can 
be rather general, it is essentially a model for 'univariate' variation and 
unable to handle in a flexible way, for example, time-series data. 



Driven both by such considerations of statistical modelling (jWalker et al 



1999h . or curious pursuit of more general mathematical results, the Di 



richlet process has proved a fertile starting point for numerous general- 
isations, and we touch on just a few of these here. 



The Poisso n— Dirichl e t dis tribution and its two-parameter gen- 
eralisation. Kingman ( 19751) observed and exploited the fact that the 
limiting behaviour of random discrete distributions could become non- 
trivial and accessible through permutation of the components to be in 
ranked (decreasing) order. The limit law is th e Poisson-Dirich let distri- 
bution, implicitly defined and later described (jKingmanl . Il993l page 98) 
a s 'rather less than us e r-frien dly ' . 



Donnelly and Jovcd (|l989j ) elucidated the role of both ranking and 



size-bias ed sampling i n esta blishing limit l a ws fo r random distributions; 
see also iHolstJ l|200ll) and Urratia et all |2003l . page 107). The two- 
parameter generalisation of the Poisson-D irichlet model was disc overed 
by Pitman and co-workers, see for example IPitman and Yor (1997). This 
has been a rich topic for probabilistic study to the present day; see 
chapters by Gnedin, Haulk and Pitman, and by Aldous in this volume. 
The simplest view to take of the two-parameter Poisson-Dirichlet model 
PD(a, 9) is to go back to stick-breaking (Section I2.2[) and replace the 



Bct a(l,fl) distribution f or the variables Vj there by Beta(l — a, 9 + ja) 



Ishwaran and Jamesl (|200lh have considered Bayesian statistical ap- 



plications of stick-breaking priors defined in this way, and implementa- 
tion of Gibbs sampling for computing posterior distributions. 



Dirichlet process relations in structured dependent models. 

Motivated by the need to build statistical models for structured data 
of various kinds, there has been a huge effort in generalising Dirichlet 
process models for such situations — indeed, there is now an 'xDP' for 
nearly every letter of the alphabet. 
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This has become a rich and sometimes confusing area; perha ps the 



most important current models are Dependent Dirichlct processes (jMacEachern , 
19991 MacEach ern et all 120011) . Order-based dependent Diric hlet pro- 



cesse s ([Griffin and Steell . 120061 ) . Hierarchical D irichlet processes ( Teh et al 
200G). and Kernel stick breaking processes ([Dunson and Park! 



2007). 



Many of the models are based on stick-breaking representations, but in 
which the atoms and/or the weights for the representations of different 
components of the process a re made dependent on each other, or on 
covariates. The new book by Hiort et al. ( 2010l ) provides an excellent 
introduction and review of these developments. 



Polya trees. Ferguson's definition of the Dirichlet process focussed on 
the (random) probabilities to be assigned to arbitrary partitions (Section 
12. ip . As we have seen, the resulting distributions G are almost surely 
discrete. An effective way to modify this process to control continuity 
properties is to limit the partitions to which elementary probabilities 
are assigned, and in the case of Polya tree processes this is achieved by 
imposed a fixed binary partition of f2, and assigning probabilities to suc- 
cessive branches in the tree through independent Beta distributions. The 
parameters of these distributions can be set to obtain various degrees 
of smoothness of the resulting G. This appro ach, ess e ntially begin ning 
wit h Ferguson himself, has been pursued by Lavinel ( 1992 . 1994 ): see 



also Walker et al. ( 199 



4 Polya urn schemes and MCMC samplers 



There is a huge literature on Markov chain Monte Carlo methods for pos- 

i 1 1 1 1 1 

terior sampling in Dirich l et mixture mode ls (MacEachern, 1994; Escobar and West 



1995: lMiiller et allll996tlMacEachern and Mullerlll998tlNeall2000l:lGreen and Richardson 
2001). Although these models have 'variable dimension', the posteri- 
ors can be sa mpled without necessarily using reversible jump methods 
( Greenl[l995l) . 

Cases where Go is not conjugate to the data model f{-\4>) demand 
keeping in the state vector, to be handled through various aug- 
mentation or reversible jump schemes. In the conjugate case, however, 
it is obviously appealing to integrate (f> out, and target the Markov chain 
on the posterior solely of the partition, generating cf> values subsequently 
as needed. To discuss this, we first go back to probability theory. 
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4.1 The Polya urn representation of the Dirichlet 

process 

The Polya urn is a simple and well-known discrete probability model for 
a reinforcement process: coloured balls are drawn sequentially from an 
urn; after each is drawn it is replaced, together with a new ball of the 
same colour. This idea can be seen in a generalised form, in a recursive 
definition of the joint distribution of the {4>i}- 
Suppose that for each n = 0, 1, 2, . . . , 

1 " 9 

(j) n+ l | (j)l,(j)2 ,...,(f> n ~ — y^^i H T^ G 0, (4-1) 



where 8 > 0, Go is an a rbitrary probability distrib u tion, and 6$ is a point 



probability mass at (j). iBlackwell and MacQueenl ([19731 ) termed such a 



sequence a Polya sequence; they showed that the conditional distribu- 
tion on the right hand side of (I4.1[) converges to a random probability 
distribution G distributed as DP( 9,Gq), and that , give n G, 4>\, <t>2 , ■ ■ ■ 
are i.i.d. distributed as G. See also I Antoniakl (|1974h and lPitman (|l995h . 

Thus we have yet another approach to defining the Dirichlet process, 
at least in so far as specifying the joint distribution of the {4>i} is con- 
cerned. This representation has a particular role, of central importance 
in computing inferences in DP models. This arises directly from (14.11) 
and the exchangeability of the {<fii}, for it follows that 



(4.2) 



where means {<pj : j = 1, 2, . . . ,n, j i}. In this form, the state- 
ment has an immediate role as the full conditional distribution for each 
component of (</>i)™ =1 , and hence defines a Gibbs sampler update in a 
Markov chain Monte Carlo method aimed at this target distribution. By 
conjugacy this remains true, with obvious changes set out in the next 
section, for posterior sampling as well. 

The Polya urn representation of the Dirichlet process has been the 
point of departure for yet another class of pro bability models, namely 
species sampling models ( Pitman . 1995 . 199 ^1. that are beginn ing to 
find a use in statistical methodology (jlshwaran and Jamesl . 12003 ). 
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4.2 The Gibbs sampler for posterior sampling of 
allocation variables 



We will consider posterior sampling in the conjugate case in a more gen- 
eral setting, specialising back to the Dirichlet process mixture case later. 
The set-up we will assume is based on a partition model: it consists of 
a prior distribution p(c\9) on partitions c of {1,2, ... ,n} with hyper- 
parameter 9, together with a conjugate model within each cluster. The 
prior on the cluster-specific parameter <f>j has hyperparameter ip, and 
is conjugate to the likelihood, so that for any subset C C {1,2,..., n}, 
p(Yc\ip) is known explicitly, where Yc is the subvector of (li)f =1 with 
indices in C. We have 



We first consider only re-allocating a single item at a time (the single- 
variable Gibbs sampler for Ci). Then repeatedly we withdraw an item, 
say i, from the model, and reallocate it to a cluster according to the full 
conditional for c;, which is proportional to p(c|Y", 9, ip). It is easy to see 
that we have two choices: 

• allocate Yi to a new cluster C*, with probability 



where c 1 ^* denotes the current partition c with i moved to C*, or 



where c?~*i denotes the partition c, with i moved to cluster Cj. 

The ratio of marginal likelihoods p(Y |?/>) in the second expression can 

be interpreted as the posterior predictive distribution of Yi given those 

observations already allocated to the cluster, i.e. p(Yi\Y c -i, ip) (a mul- 

j ____ 

tivariate t for the Normal-inverse gamma set-up from Section [3.2[1 . 
For Dirichlet mixtures we have, from (12.11) . 




ocp(c^*\9) Xp(Yi\tp), 




|V0Mv cr #), 

J 



T(8) 



(i 



p(c\9) 



T(9 + n) 



8 d H(n j -l)\, 



where nj = jj^Cj and c = (Ci, C2, . . . , Cd), so the re-allocation probab- 
ilities are explicit and simple in form. 



14 



Peter J. Green 



But the same sampler can be used for many other partition models, 
and the idea is not limited to moving one item at a time. 

4.3 When the Polya urn sampler applies 

All we require of the model for the Polya urn sampler to be available for 
posterior simulation are that 

1. a partition c of {1, 2, . . . , n} is drawn from a prior distribution with 
parameter 9] 

2. conditionally on c, parameters (<pi, <j>2, ■ ■ • , 4>d) are drawn independ- 
ently from a distribution Go (possibly with a hyperparameter ip); 

3. conditional on c and on (f> = (4>i, 4>2, ■ ■ ■ , <t>d), {yi, 2/2, ■ • ■ , Un} are 
drawn independently, from not necessarily identical distributions 
p(yi\c,4>) = fi{yi\4>j) for i £ Cj, for which Go is conjugate. 

If these all hold, then the Polya urn sampler can be used; we see from 
Section 221 that it will involve computing only marginal likelihoods, and 
ratios of the partition prior, up to a multiplicative constant. The first 
factor depends only on Go and the likelihood, the second only on the 
partition model. 

Examples. p(c l ^*\9) and p(c l_>J |0) are proportional simply to 

• 9 and #G~ l for the DP mixture model, 

• (k— d(c~ l ))8 and 4t z C~ l +8 for the Dirichlet-multinomial finite mixture 
model, 

• 9+ad(c~ l ) and #C~ l — a for the Kingman-Pitman-Yor two-paramet- 
er Poisson-Dirichlet process (Section 13.31) . 

It is curious that the ease of using the Polya urn sampler has often 
been cited as motivation to use Dirichlct process mixture models, when 
the class of models for which it is equally readily used is so wide. 

4.4 Simultaneous re-allocation 

There is no need to restrict to updating only one at a time: the idea 
extends to simultaneously re-allocating any subset of items currently in 
the same cluster. 

The notation can be rather cumbersome, but again the subset forms 
a new cluster, or moves to an existing cluster, with relative probabilities 
that are each products of two terms: 
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• the relative (new) partition prior probabilities, and 

• the predictive density of the moved set of item data, given those 
already in the receiving cluster. 

A more sophisticated var iant on this scheme has been proposed by 
Nobile and Fearnside! ( 2007 ). and studied in the case of finite mixture 
models. 



5 A coloured Dirichlet process 

For the remainder of this note, we focus on the use of these models 
for clustering, rather than density estimation or other kinds of infer- 
ence. There needs to be a small caveat — mixture models are commonly 
used either for clustering, or for fitting non-standard distributions; in 
a problem demanding both, we cannot expect to be able meaningfully 
to identify clusters with the components of the mixture, since multiple 
components may be needed to fit the non-standard distributional shape 
within each cluster. Clustered Dirichlet process methodology in which 
there is clustering at two levels that can be used for such a purpose 
is under development by Dan Merl and Mike West at Duke (personal 
communication) . 

Here we will not pursue this complication, and simply consider a mix- 
ture model used for clustering in the obvious way. 

In many domains of application, practical considerations suggest that 
the clusters in the data do not have equal standing; the most common 
such situation is where there is believed to be a 'background' cluster, and 
one or several 'foreground' clusters, but more generally, we can imagine 
there being several classes of cluster, and our prior beliefs are represented 
by the idea that cluster labels are exchangeable within these classes, but 
not overall. It would be common, also, to have different beliefs about 
cluster-specific parameters within each of these classes. 

In this section, we present a variant on standard mixture/cluster mod- 
els of the kinds we have already discussed, aimed at modelling this situ- 
ation of partial exchangeability of cluster labels. We stress that it will 
remain true that, a priori, item labels are exchangeable, and that we 
have no prior information that particular items are drawn to particular 
classes of cluster; the analysis is to be based purely on the data {Yi}. 

We will describe the class of a cluster henceforth as its 'colour'. To 
define a variant on the DP in which not all clusters are exchangeable: 
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1. for each 'colour' k — 1, 2, ... , draw Gk from a Dirichlet process 
DP(9k,Gok), independently for each k; 

2. draw weights (wk) from the Dirichlet distribution Dir(7i,7 2 , . . .), in- 
dependently of the Gfc; 

3. define G on {k} x fl by G(k, B) = w k G k {B); 

4. draw colour-parameter pairs (ki, <pi) i.i.d. from G, i = 1, 2, . . . , n. 

This process, denoted CDP({(7fc, 9k, Gok)}), is a Dirichlet mixture of 
Dirichlet processes (with different base measures), ^ fe WfcDP(#fc, Gofe), 
with the added feature that the the colour of each cluster is identified 
(and indirectly observed), while labelling of clusters within colours is 
arbitrary. 

It can be defined by a 'stick-breaking-and-colouring' construction: 

1. colour segments of the stick using the Dirichlet({7fc})-distributed 
weights; 

2. break each coloured segment using an infinite sequence of independent 
Beta(l,#/c) variables Vj k ; 

3. draw 4>* k ~ G ofc , i.i.d., j = 1, 2, . . . ; k = 1, 2, . . . ; 

4. define Gk to be the discrete distribution putting probability (1 — 



Note that in contrast to other elaborations to more structured data 
of the Dirichlet process model, in which the focus is on nonparametric 
analysis and more sharing of information would be desirable, here, where 
the focus is on clustering, we are content to leave the atoms and weights 
within each colour completely uncoupled a priori. 

5.1 Coloured partition distribution 

The coloured Dirichlet process (CDP) generates the following partition 
model: partition {1,2, ... ,n} = UfeUj=iCfcj at random, where G\j is 
the jth cluster of colour k, so that 



where n k j = #C k j, n k = J2j n kj- 

It is curious to note that this expression simplifies when 9k = jk, 
although such a choice seems to have no particular significance in the 



V lk )(l - V 2k ) ■••(!- Vj- llk )V jk on <p* jk . 



p(Cn, C12, . . . , Cidi \ CW, • • • , c 2 d 2 ; C31, . . .) — 
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probabilistic construction of the model. Only when it is also true that 
the 8k are independent of k (and the colours are ignored) does the model 
degenerate to an ordinary Dirichlct process. 

The clustering remains exchangeable over items. To complete the con- 
struction of the model, analogously to Section 12.41 for i G Cjy , we set 
ki = k and fa = fa-, where are drawn i.i.d. from Gofc- 

5.2 Polya urn sampler for the CDP 

The explicit availability of the (coloured) partition distribution immedi- 
ately allows generalisation of the Polya-urn Gibbs sampler to the CDP. 

In reallocating item i, let denote the number among the remain- 
ing items currently allocated to Ckj, and define n k l accordingly. Then 
reallocate i to 

• a new cluster of colour k, with probability oc 9k x (7^ + + 
n^)xp(Yi|V0,forfc = l,2,... 5 

• the existing cluster Ckj, with probability oc njy x (7^. + n k ~ l )/(9k + 
n~ k l ) x P (Yi\Y c -i, V), for 3 = 1, 2, . . . , n?; k = 1, 2, . . . . 

Again, the expressions simplify when 9k = 7fe. 

5.3 A Dirichlet process mixture with a background 

cluster 

In many applications of probabilistic clustering, including the gene ex- 
pression example from Section 13.21 it is natural to suppose a 'back- 
ground' cluster that is not a priori exchangeable with the others. One 
way to think about this is to adapt the 'limit of finite mixtures' view 
from Section |2~31 

1. draw (wq, w\, W2, • • ■ , Wk) ~ Dirichlet(7, S, . . . , 5); 

2. draw Ci £ {0, 1, . . . , k} with P{cj = j} = Wj, i.i.d., i = 1, . . . , n; 

3. draw - F > 0} ~ Go, i-i-d., j = 1, . . . , fc; 

4. set 0j = 0* . 

Now let k — > 00, 5 — >• such that fc5 — > 0, but leave 7 fixed. The cluster 
labelled represents the 'background'. 

The background cluster model is a special case of the CDP, specifically 
CDP({(7, 0, H ), (9, 9, Go)}). The two colours correspond to the back- 
ground and regular clusters. The limiting-case DP(0, Hq) is a point mass, 
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randomly drawn from Hq. We can go a little further in a regression set- 
ting, and allow different regression models for each colour. 

The Polya urn sampler for prior or posterior simulation is readily 
adapted. When re-allocating item i, there are three kinds of choice: a 
new cluster C\, the 'top table' Co, or a regular cluster Cj,j ^ 0: the 
corresponding prior probabilities p(c l ^*\8), p(c l ^°\8) and p{c % ~^i\9) are 
proportional to 0, (7 + #C -4 ) and #CJ % for the background cluster 
CDP model. 



5.4 Using the CDP in a clustered regression model 

As a practical illustration of the use of the CDP background cluster 
model, we discuss a regression set-up that expresses a vector of meas- 
urements yi = (yu, . . . ,2/is) for i = 1, . . . , n, where S is the number 
of samples, as a linear combination of known covariates, (zi • • • zg) with 
dimension K' and (xi • • -xg) with dimension K. These two collections 
of covariates, and the corresponding regression coefficients 5j and /3 -, 
are distinguished since we wish to hold one set of regression coefficients 
fixed in the background cluster. We assume 
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= [z 1 ---z s ]'6 j + [ Xl ---x s }'P j +e j (5.1) 

where e 3 ~ N(0s x i, T j" 1 Isxs), 0,sxi is the S'-dimension zero vector 
and IsxS is the order-S* identity matrix. Here, Sj, f3j and Tj are cluster- 
specific parameters. The profile of measurements for individual i is = 
[yu ' • ' Vis)' for i = 1, . . . , n. Given the covariates z s — [z s i • ■ • z s k>}' , 
x s = [x s \ ■ ■ ■ x s k}' , and the cluster j, the parameters/latent variables 
are Sj — [5j\ ■ ■ ■ SjK 1 ]' , /3j = [13 ji • ■ • Pjk} 1 and Tj. The kernel is now 
represented as k(yi\Sj, [3j, Tj), which is a multivariate Normal density, 
AT([zi • • • zs]'Sj + [xi • • • xs]'(3j, T~ 1 Is X 5). In particular, we take differ- 
ent probability measures, the parameters of heterogeneous DP, for the 
background and regular clusters, 

u = (5 ,f3 ,T ) - H (dS ,df3 ,dT ) 

= 5fi o (dd a ) x Normal-Gamma(d/3 , dr _1 ); 
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uj = (5j,/3j,Tj) ~ GaidS^dPpdTj) 

= Normal-Gamma(rf(5j ,fij)', dr^ 1 ) 
for j = 1, . . . , n(p) - 1. 

Here Ho is a probability measure that includes a point mass at Sq and 
a Normal-Gamma density for /3 and Tq -1 . On the other hand, we take 
Go to be a probability measure that is a Normal-Gamma density for 
(S'jjfl'j)' and tJ x . Thus the regression parameters corresponding to the 
z covariates are held fixed at So m the background cluster, but not in 
the others. 

We will first discuss the marginal distribution for the regular clusters. 
Given tj, (5'j,(3'j)' follows the (K' + K )-dimensional multivariate Nor- 
mal with mean m and variance (T^t) -1 and Tj follows the univariate 
Gamma with shape a and scale b. We denote the joint distribution 
Go{d(S'j, j3'j)' 7 drj) as a joint Gamma and Normal distribution, Normal- 
Gamma^, b, m, t), and further we take 



m = 


m 5 


and t = 


t s 0" 








.0 t p _ 



(5.2) 



Based on this set-up, we have 
m Go {y C] ) = 

t 2 ~ a {Y C] \Z Cj m s + X c .m^, l(Z Cj t^Z' c + X c .t^X' c + l ejS ^s)), 

a 1 

(5.3) 

where Y C] = [y' n ■ ■ ■ y' ie J, X Cj = [[xi • • • x s ] • • • [xi • • • x s ]]' and Z C] = 
[[zi • • -zg] • • • [zi • • -zs]]' for Cj — {ii, . ..,i ej }. Note that Yc. is a ejS 
vector, is a ejS x K' matrix and X^ is a ejS x K matrix. Moreover, 
m Go(yCj) is a multivariate t density with mean Zemins -fX^m^, scale 

^(Z Cj tJ 1 Z' Cj + X Cj t~ 1 X' Cj + I ej sxe J s)) 

and degree of freedom 2a. 

For the background cluster, we take H to be a joint Gamma and Nor- 
mal distribution, Normal-Gamma(a, b, m^, tp). The precision To follows 
the univariate Gamma with shape a and scale b. Given To, f3 follows the 
X-dimension multivariate Normal with mean and variance (T t^) _1 
and tq follows the univariate Gamma with shape a and scale b. The mar- 
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ginal distribution becomes 



m Ho {yc ) = ^2q(Yc 3 \Z Cj S + X Cj mp, -(X Cj .t^ X' Cj + I ej sxejs))- 

. (5-4) 

So rnH {yCo) is a multivariate t density with mean TiCjSq + Xcr.m^, 
scale 

= ( X C 3 tg 1 X'cr. + I ej S x e 3 S ) 

and degree of freedom 2a. 

In some applications, the :rs and /3s are not needed and so can be 
omitted, and we consider the following model, 



[zi • • • z s ]'6j + e,-; 



(5.5) 

here we assume that K = or [xi • • • xg]' = Osx a where Og X A is the 
S y. K matrix with all zero entries of the model (|5.1[) . We can derive the 
marginal distributions analogous to (|5.3D and (|5.4I) . 
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m Go (yc 3 ) =i2H(Y C3 |Z C3 rii 5 ,^(Z C3 t7 1 Z' Cj . +I ej sxe 3 s)), (5.6) 



Wl£fo(yC ) = *2a(Y C , |Z C ,<5 , -I e ,Sxe,s)- 

a 



(5.7) 



Here t v (x S ) is a multivariate i density in d dimensions with mean 
fi and scale S with degrees of freedom v\ 



r((i/)/2) (t/vr) d / 2V i/ 



(5.8) 



5.5 Time-course gene expression data 

We consider the application of th is methodo l ogy t o data from a gene 



expression time-course experiment. I Wen et a l. (1998) studied the c entral 



nervous system development of the rat; see also lYeung et al.l ([20011 ). The 
mRNA expression levels of 112 genes were recorded over the period of 
development of the central nervous system. In the dataset, there are 9 
records for each gene over 9 time points; they are from embryonic days 
11, 13, 15, 18, 21, postnatal days 0, 7, 14, and the 'adult' stage (postnatal 
day 90). 
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B/g cluster (size: 21 ) Cluster 1 (size: 17 ) 
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Fi gure 5.1 Profile p lot of our partition estimate for the Rat data set 
of lWen et all l| 19981 ). 



In their analysis. IWen et al.l ([19981 ) obtained 5 clusters/waves (totally 
6 clusters), taken to characterize distinct phases of development. The 



data set is available at http : //faculty .Washington . edu/kayee/ cluster/GEMraw . txt 

9 and K' 



We take S 
be 



5. The design matrix of covariates is taken to 
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representing piecewise linear dependence on time, within three separate 
phases (embryonic, postnatal and adult). 

In our analysis of these data, we take = 1, 7 = 5, a = a = 0.01, 
6 = 6 = 0.01, m s = m s = [0---0]', t s = t s = 0.011, ih = [0---0]', 
tp = 0.011 and 8q = [0 • ■ ■ 0]'. The Polya urn sampler was implemented, 
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and run for 20000 sweeps starting from the partition consisting of all 
singleton clusters, 10000 being disc arded as burn-in. We then use the 
last 10000 partitions sampled as in lLau and Greenl (|2007l ). to estimate 
the optimal Bayesian partition on a decision-theoretic basis, using a 
pairwise coincidence loss function that equally weights false 'positives' 
and 'negatives'. 

We present some views of the resulting posterior analysis of this data 
set. 
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Figure 5.2 Mean and 95% CI of genes across clusters of our partition 
estimate. 



Figure l5~Tl shows the profiles in the inferred clusters plotted, and Fig- 
ure 15.21 the mean and the 95% CI of the clusters. Figure 15.31 cross- 
tabulates the clu sters with t he bi ological functions attributed to the 
relevant genes by Wen et al. (|l998l) . 
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General gene Class Neurotransmitter receptors 

Ligand class Sequence class 

% peptide % neurotr. % neuroglial % diverse % ion % G protein 

% ACh % GABA % Glu % 5HT 

signaling receptors markers channel coupled 
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Figure 5.3 Biological func tions of our Bayes ian partition estimate for 
the genes in the data set of lWen et alJ ljl998f ) , showing correspondence 
between inferred clusters and the functional categories of the genes. 
All genes are classified into 4 general gene classes. Additionally, the 
Neurotransmitter genes have been further categorised by ligand class 
and functional sequence class. Boldface type represents the dominant 
class in the cluster, in each categorisation. 

expression data, and John Kingman for his undergraduate lectures in 
Measure and Probability. 
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